Predication-based Semantic Indexing:
نویسندگان
چکیده
Corpus-derived distributional models of semantic distance between terms have proved useful in a number of applications. For both theoretical and practical reasons, it is desirable to extend these models to encode discrete concepts and the ways in which they are related to one another. In this paper, we present a novel vector space model that encodes semantic predications derived from MEDLINE by the SemRep system into a compact spatial representation. The associations captured by this method are of a different and complementary nature to those derived by traditional vector space models, and the encoding of predication types presents new possibilities for knowledge discovery and information retrieval. Introduction The biomedical literature contains vast amounts of knowledge that could inform our understanding of human health and disease. Much of this literature is available as electronic text, presenting an opportunity for the development of automated methods to extract and encode knowledge in computer-interpretable form. Distributional models of language are able to extract meaningful estimates of the semantic relatedness between terms from unannotated free text. These models have proved useful in a variety of biomedical applications (for a review see (1)), and include recent variants that scale comfortably to large biomedical corpora such as the MEDLINE corpus of abstracts (2). However, the semantic relatedness estimated by most distributional models is of a general nature. These models do not encode the type of relationship that exists between terms, which limits their ability to support logical inference. Furthermore, while distributional models such as Latent Semantic Analysis (LSA) simulate human performance in many cognitive tasks (3), they do not represent the object-relation-object triplets (or propositions) that are considered to be the atomic unit of thought in cognitive theories of comprehension (4). In this paper we address these issues by defining Predication-based Semantic Indexing (PSI) a novel distributional model of language that encodes semantic predications derived from MEDLINE by the SemRep system (5) into a compact vector space representation. Associations captured by PSI complement those captured by existing models, and present new possibilities for knowledge discovery and information retrieval. Background Many existing distributional models draw their estimates of semantic relatedness from co-occurrence statistics within a defined context such as a sliding window or an entire document (1). While recent models (reviewed in (6)) instead define as a context grammatical relationships produced by a parser, these models do not encode the nature of this grammatical relationship in a retrievable manner. The emergence of distributional models that incorporate word order (7), (8) has shown that it is possible to encode and retrieve additional information within a vector space. These models achieve this end by using either convolution products (7) or permutation of sparse random vectors (8) to transform vectors used to represent terms into new representations that are closeto-orthogonal to the original vectors. Consequently there is very little overlap in the information they carry, and additional information related to term position can be encoded. These transformations are reversible, to facilitate retrieval of this information. PSI is based on Sahlgren and his colleagues' model which uses permutations as a means to encode word order information (8), which in turn is a variant of the Random Indexing (RI) approach to distributional semantics (9). This approach provides a simple and elegant solution to the problem of reversibly transforming term vectors using permutations of the sparse random vectors which form the basis of RI. Sahlgren et al's approach is derived from sliding-window (or term-term) RI, which bases its vector representations for terms on their co-occurrence with other terms in a sliding window moved through the text. While the sliding window approach is well-established in distributional semantics, established methods either use the full term-by-term space or reduce the dimensionality of this space using the computationally demanding Singular Value Decomposition (SVD). RI is able to achieve this dimension reduction step at a fraction of the computational cost of SVD by constructing semantic vectors for each term on-the-fly, without the need for a term-by-term matrix. Each term in the text corpus is assigned an elemental vector of dimensionality d (usually in the order of 1000), the dimension of a reduceddimensional semantic space within which the semantic relatedness of terms will be measured. These elemental vectors are sparse: they contain mostly zeros, with in the order of 10 non-zero values of either +1 or -1. As there are many possible permutations of these few non-zero values, elemental vectors tend to be close-to-orthogonal to one another: their relatedness as measured with the commonly used cosine metric tends towards zero. This approximates a full term-by-term matrix, but rather than assigning in orthogonal dimension to each term, RI assigns a close-to-orthogonal reduced-dimensional elemental vector. To encode additional information to do with word order, the elemental vector for a given term is permuted to produce a new vector, almost orthogonal to the vector from which it originated. Consider the lowdimensional approximations of elemental vectors below: V1: [ 1 0 0 0 0 1 0 0 0 0 0 -1 0 0 0] V2: [ 0 1 0 0 0 0 1 0 0 0 0 0 -1 0 0] These two vectors are orthogonal to one another: as there is no common non-zero dimension between them, their cosine (or normalized dot-product) will be zero. V2 was derived from V1 by rotating every value one position to the right, and conversely this transformation can be reversed by rotating every value in V2 one position to the left. This simple procedure is used by Sahlgren et al to encode word-order information into a term-term based semantic space. The semantic vector for each term consists of the normalized linear sum of the permuted elemental vector for every term with which it co-occurs, with permutation encoding the relative position and direction of each term in the sliding window. The reversible nature of this transformation facilitates orderbased retrieval. For example, a rotation one position to the right of all elements of the elemental vector for a term can be used to generate a vector with high similarity to terms occurring one space to the left of it. Table I provides some examples of order-based retrieval in a permutation-based space created from the MEDLINE corpus of abstracts using the Semantic Vectors package, to which author TC is a contributor (10). ? cancer streptococcus ? ? cough .81:breast .71:pyogenes .89:whooping .78:colorectal .71:agalactiae .48:nonproductive .74:prostate .69:pyogens .47:hacking .69:antiprostate .65:milleri .44:brassy .67:antibreast .62:acidominimus .42:barking Table I: Order-based retrieval from MEDLINE. The “?” denotes the relative position of the target term. In this paper, we adapt Sahlgren et al's method of encoding word order information into a vector space to encode semantic predications produced by the SemRep system (5), (11). SemRep combines general linguistic processing, a shallow categorical parser and underspecified dependency grammar, with domainspecific knowledge resources: mappings from free text to the UMLS accomplished by the MetaMap software (12), the UMLS metathesaurus and semantic network (13) and the Specialist lexicon and lexical tools (14). SemRep uses these techniques to extract semantic predications, from titles and abstracts in the MEDLINE database, as shown in this example drawn from (5). Given the excerpt “... anti-inflammatory drugs that have clinical efficacy in the management of asthma,....”, SemRep extracts the following semantic predication between UMLS concepts: “Anti-Inflammatory Agents TREATS Asthma” We present in this paper a description of the theoretical and methodological basis of PSI, and include examples of the sorts of information the model encodes and retrieves discussed in context of possible applications. Methods We have derived a PSI space from a database of semantic predications extracted by SemRep from MEDLINE citations dated between 2003 and September 9 2008. 13,562,350 predications were extracted from 2,634,406 citations by SemRep. Of these, predications involving negation (such as “DOES NOT TREAT”) are excluded, leaving 13,380,712 predications which are encoded into the PSI space. We encode this predication information using permutation-based RI. Rather than assigning elemental vectors to each term as in Sahlgren et al's model, we assign sparse elemental vectors (d=500) to each UMLS concept contained in the predications database. We then assign a unique number to each of the included predication types (such as “TREATS”). We create semantic vectors (d=500) for each UMLS concept in the database, and every time a given UMLS concept occurs in a predication, we add to its semantic vector the elemental vector of the other concept in the predication. This elemental vector is permuted according to the type of the predication. For example, in the predication “AntiInflammatory Agents TREATS Asthma” we would add to the semantic vector for Anti-Inflammatory Agents but rotate every element in this 39 (the number assigned to the predicate “TREATS”) steps to the left. Conversely, we would add to the semantic vector for Asthma the index vector for Anti-Inflammatory Agents rotated 39 steps to the right. In this way we are able to encode the type of predication that exists between these concepts. We also construct a general distributional model of the UMLS concepts in the database of predications using the Reflective Random Indexing (RRI) model (15), by creating document vectors for each unique PubMed ID in the database. Document vectors are created based on the terms contained in these citations: elemental vectors are assigned to each term, and document vectors are constructed as the normalized linear sum of the elemental vector for each term they contain. Rather than using raw term frequency, we employ the log-entropy weighting scheme, shown to enhance document representations in several applications (3). A vector for each concept is constructed as the frequency-weighted normalized linear sum of the vector for each document it occurs in. PSI requires a modification of the conventional nearest neighbor approach, as we are interested in the strongest association between concepts across all predications. In the modified semantic network used by SemRep (16), there are 40 permitted predications between concepts when negations (e.g. exercise DOES NOT TREAT hiv) are excluded. Semantic distance in PSI is measured by extracting all permutations of a concept, and comparing the second concept to these to find the predication with the strongest association. For elemental vectors, we employ a sparse representation used in our previous work (2) which represents the dimension and sign of each of the 20 non-zero values. This allows for rapid generation of all possible permutations by augmenting the value that represents the index of each non-zero value. To speed up this process in the EpiphaNet example (Figure 1), we extract the 500 nearest neighbors to a cue concept from the general distributional space (this should subsume the predication-based space: every concept in a predication must co-occur in a citation with the other concept concerned). We then perform predication-based nearestneighbor search on these neighbors only. Results and Discussion Predication-based retrieval In a manner analogous to the order-based retrieval illustrated previously, it is possible to perform predication-based retrieval using permutations to determine which UMLS concept the model has encoded with strong association to another concept in a particular predication relationship. Table II illustrates predicationbased retrieval. For example, the query “? TREATS Asthma” retrieves concepts for asthma treatments (sb240563 is also known as Mepolizumab, and has recently been shown to reduce exacerbations in asthma (17)) . ? TREATS Asthma Metronidazole TREATS ? 1:cetirizine.57: chronic intestinal pseudoephedrine amebiasis 1: norisodrine .36 : urogenital trichomonas
منابع مشابه
Logical Leaps and Quantum Connectives: Forging Paths through Predication Space
The Predication-based Semantic Indexing (PSI) approach encodes both symbolic and distributional information into a semantic space using a permutation-based variant of Random Indexing. In this paper, we develop and evaluate a computational model of abductive reasoning based on PSI. Using distributional information, we identify pairs of concepts that are likely to be predicated about a common thi...
متن کاملPredication-based Semantic Indexing: Permutations as a Means to Encode Predications in Semantic Space
Corpus-derived distributional models of semantic distance between terms have proved useful in a number of applications. For both theoretical and practical reasons, it is desirable to extend these models to encode discrete concepts and the ways in which they are related to one another. In this paper, we present a novel vector space model that encodes semantic predications derived from MEDLINE by...
متن کاملUsing latent semantic analysis and the predication algorithm to improve extraction of meanings from a diagnostic corpus.
There is currently a widespread interest in indexing and extracting taxonomic information from large text collections. An example is the automatic categorization of informally written medical or psychological diagnoses, followed by the extraction of epidemiological information or even terms and structures needed to formulate guiding questions as an heuristic tool for helping doctors. Vector spa...
متن کاملتأملاتی بر نمایه سازی تصاویر: یک تصویر ارزشی برابر با هزار واژه
Purpose: This paper presents various image indexing techniques and discusses their advantages and limitations. Methodology: conducting a review of the literature review, it identifies three main image indexing techniques, namely concept-based image indexing, content-based image indexing and folksonomy. It then describes each technique. Findings: Concept-based image indexing is te...
متن کاملEmbedding Probabilities in Predication Space with Hermitian Holographic Reduced Representations
Predication-based Semantic Indexing (PSI) is an approach to generating high-dimensional vector representations of concept-relation-concept triplets. In this paper, we develop a variant of PSI that accommodates estimation of the probability of encountering a particular predication (such as fluoxetine TREATS major depressive disorder) in a collection of predications concerning a concept of intere...
متن کامل